Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

35

FIGURE 2.12

(a) We select τ and λ using 4-bit Q-DETR-R50 on VOC. (b) The mutual information curves

of I(X; E) and I(yGT ; E, q) (Eq. 2.27) on the information plane. The red curves represent

the teacher model (DETR-R101). The orange, green, red, and purple lines represent the

4-bit baseline, 4-bit baseline + DA, 4-bit baseline + FQM, and 4-bit baseline + DA +

FQM (4-bit Q-DETR).

memory. We use ImageNet ILSVRC12 [123] to pre-train the backbone of a quantized stu-

dent. The training protocol is the same as the employed frameworks [31, 70]. Specifically,

we use a batch size of 16. AdamW [164] is used to optimize the Q-DETR, with the ini-

tial learning rate of 1e4. We train for 300/500 epochs for the Q-DETR on VOC/COCO

dataset, and the learning rate is multiplied by 0.1 at the 200/400-th epoch, respectively.

Following the SMCA-DETR, we train the Q-SMCA-DETR for 50 epochs, and the learning

rate is multiplied by 0.1 at the 40th epoch on both the VOC and COCO datasets. We utilize

a multi-distillation strategy, saving the encoder and decoder network as real-valued at the

first stage. Then we train the fully quantized DETR at the second stage, where we load

the weight from the checkpoint of the first stage. We select real-valued DETR-R101 (84.5%

AP50 on VOC and 43.5% AP on COCO) and SMCA-DETR-R101 (85.3% AP50 on VOC

and 44.4% AP on COCO) as teacher network.

Hyper-parameter selection. As mentioned, we select hyper-parameters τ and λ in

this part using the 4-bit Q-DETR model. We show the model performance (AP50) with

different setups of hyper-parameters {τ, λ} in Fig. 2.12 (a), where we conduct ablative ex-

periments on the baseline + DA (AP50=78.8%). As can be seen, the performances increase

first and then decrease with the increase of τ from left to right. Since τ controls the propor-

tion of selected distillation-desired queries, we show that the full-imitation (τ = 0) performs

worse than the vanilla baseline with no distillation (τ = 1), showing query selection is nec-

essary. The figure also shows that the performances increase first and then decrease with

the increase of τ from left to right. The Q-DETR performs better with τ set as 0.5 and 0.6.

With the varying value of λ, we find {λ, τ} = {2.5, 0.6} boost the performance of Q-DETR

most, achieving 82.7% AP on VOC test2007. Based on the ablative study above, we set

hyper-parameters τ and λ as 0.6 and 2.5, respectively, for the experiments in this paper.

Effectiveness of components. We show quantitative component improvements in Q-

DETR in Table 2.2. As shown in Table 2.2, the quantized DETR baseline suffers a severe

performance drop on AP50 (13.6%, 6.5%, and 5.3% with 2/3/4-bit, respectively). DA and

FQM improve the performance when used alone, and the two techniques further boost the

performance considerably when combined. For example, the DA improves the 2-bit baseline